Advanced mixtures for complex high dimensional data: from model-based to Bayesian non-parametric inference

ثبت نشده

چکیده

Cluster analysis of complex data is an essential task in statistics and machine learning. One of the most popular approaches in cluster analysis is the one based on mixture models. It includes mixture-model based clustering to partition individuals or possibly variables into groups, block mixture-model based clustering to simultaneously associate individuals and variables to clusters, that is co-clustering, generally for independent multivariate data, and regression mixture-model based clustering approaches for conditionally dependent data, including functional data. These modelbased (co)-clustering approaches rely on density estimation, possibly Bayesian, of parametric finite (block) mixtures. These parametric models may be limited in some situations namely to represent complex and realistic data sets. Another issue in the finite mixture model-based approach is the one of model selection, which is in general equivalent to selecting the number of mixture components. The Bayesian non-parametric alternative based on Dirichlet Process mixture (DPM) or infinite mixtures, provides a principled way to overcome these issues. The aim of this project is to propose new Bayesian non-parametric mixtures and block mixtures for model-based (co-)clustering of high dimensional multivariate data, and new regression mixture model-based approaches for clustering complex functional data. I will perform my research within the Probability and Statistics team of “Laboratoire Paul-Painlevé” UMR CNRS 8524, one of the first teams over the world in the domain. 1 Scientific context and objectives 1.1 Scientific context The problem of complex data analysis is a central topic of modern statistical and computer and information sciences, and is connected to both theoretical and applied parts of theses sciences, as well as to several application domains including pattern recognition, signal processing, bio-informatics data mining, complex systems modeling, etc. The analysis of complex data, in general, implies the development of statistical models and autonomous learning algorithms that aim at acquiring knowledge from raw data for analysis, interpretation and to make accurate decisions and predictions for future data. The development of models therefore have to rely on well-established statistical background with efficient algorithmic tools to deal namely with the problem of data complexity, including heterogeneity, missing information, high dimensionality, big volume, dynamical structure, etc. To ensure such reliability of models and algorithms for decision from raw data, it is important to understand the processes generating such data. From a statistical learning prospective, this in general leads to generative learning approaches, which will be in the core of this project. Generative model-based approaches are indeed very useful and well-established statistical models that explicit the processes generating the data. They are very suitable in many contexts, in particular the unsupervised context when the supervision (e.g., expert information) is missing, hidden or difficult to obtain, etc, and are useful for many applications, including clustering and classification of grouped data. Latent data models, including (Bayesian) mixture model-based approaches are one of the most popular and successful generative Faicel Chamroukhi Projet de délégation CNRS, année 2015/2016 1/5 Projet de délégation CNRS, 2015/2016 Advanced mixtures for complex high dimensional data: from model-based to Bayesian non-parametric inference unsupervised learning approaches. They are very used in particular in cluster analysis for automatically finding clusters, in discriminant analysis when the classes are dispersed, as well as in regression when the predictor is governed by a hidden process, etc. Such models are naturally tailored for an exploratory data analysis and rely on efficient algorithmic tools such as the expectation-maximization (EM) algorithms or Markov Chain Monte Carlo (MCMC) sampling techniques. These approaches have been mainly developed for multivariate data analysis. On the other hand, in many areas of application, including signal and image processing, functional imaging, handwritten text recognition, bio-informatcis, etc., the analyzed data are often available in the form of discretized values of functions or curves (e.g., times series, waveforms, etc) and surfaces (2D-images, spatio-temporal data, etc) which makes them very structured. This “functional” aspect of the data adds additional difficulties compared to the case of a classical multivariate analysis. Since the standard multivariate latent data model, including mixture-model based approaches rely on a sound statistical background, and thus are likely to be very beneficial for functional data analysis, there is a growing investigation of adapting them to the context of functional data, including curves and surfaces. But this is still less well investigated as compared to the case of multivariate (non-functional) data. 1.2 Scientific objectives The scientific objectives of this project are therefore two-fold. The first main objective is to develop flexible mixture model-based cluster analysis approaches for high dimensional multivariate data. This includes Bayesian parametric and non-parametric mixtures for clustering, as well as for co-clustering (block mixture-model based approaches). The second main objective is to develop mixture-model based model-based cluster and discriminant analysis approaches for functional data, including curves and surfaces. This mainly involves the development of regression mixture models for curves and surfaces. In particular, I will focus on the unsupervised learning context. The problem of building fully unsupervised approaches will be central to the project. Hence, the problem of model selection will be central in order to develop fully unsupervised techniques, that automatically learn, compare and select the models. I will consider the problem of model selection from two prospectives: the first one relying on model selection criteria such as BIC or the Bayes Factors. While the second one considers the Bayesian non-parametric alternative, that is the Dirichlet Process Mixture (DPM) or by equivalence the Chinese Restaurant Process mixture (CRPM). The project is structured according to the following tasks: 1. Bayesian non-parametric mixture model-based clustering 2. Bayesian non-parametric block mixture model-based co-clustering 3. Bayesian mixture model-Based functional data clustering 2 Bayesian non-parametric mixture model-based clustering 2.1 Model-based clustering Model-based clustering (McLachlan and Basford, 1988; Celeux and Govaert, 1992, 1993; Banfield and Raftery, 1993; Fraley and Raftery, 2002; Biernacki et al., 2006), generally used in a multivariate data analysis context, is based on the finite mixture model formulation (McLachlan and Peel., 2000; Titterington et al., 1985). The data probability density function is assumed to be a finite mixture density, each component density being associated with a cluster. The problem of clustering therefore becomes the one of estimating the parameters of the supposed mixture model. The model estimation can be performed by maximizing the observed-data likelihood. This approach is referred to as the mixture (or estimation) approach and in which likelihood maximization is generally achieved via the expectation-maximization (EM) algorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997) or extensions, namely stochastic extensions Celeux et al. (1996), such as stochastic EM (Celeux and Diebolt, 1985). Another approach is the classification approach, also referred to as the maximum classification likelihood approach, maximizes a classification likelihood function, which is the complete-data likelihood by using the classification version of the EM algorithm, that is the classification EM (CEM) algorithm (Celeux and Govaert, 1992). Banfield and Raftery (1993) and Celeux and Govaert (1995) developed a parsimonious GMM clustering by exploiting an eigenvalue decomposition of the group covariance matrices for the GMM, which provides a wide range of very flexible models with different clustering criteria. This decomposition leads to numerous flexible models (with different volumes, shapes and orientations) going from simplest spherical models to the complex general one ans hence is adapted to various clustering situations. The parameters of the parsimonious Gaussian mixture models may be estimated in a MLE framework by using the EM algorithm (see details in Celeux and Govaert (1995)). Fraley and Raftery (2002) also demonstrate that the parsimonious mixture models-based clustering framework provide very good results in density estimation cluster and discriminant analyses. Recently, Biernacki and Lourme (2014) proposed new parsimonious Gaussian mixture models for Faicel Chamroukhi Projet de délégation CNRS, année 2015/2016 2/5 Projet de délégation CNRS, 2015/2016 Advanced mixtures for complex high dimensional data: from model-based to Bayesian non-parametric inference clustering based on a variance-correlation decomposition of the group covariance matrices, which are stable and visualizable and have more desirable properties compared to the parsimonious GMMs of Banfield and Raftery (1993) and Celeux and Govaert (1995). The mode parameters are estimated by a Generalized EM. The MLE approach using the EM algorithm for normal mixtures may fail due to singularities or degeneracies (Stephens, 1997; Fraley and Raftery, 2007). The Bayesian approach of mixture models (Stephens, 1997; Robert, 1994; Jean-Michel Marin and Robert, 2005; Fraley and Raftery, 2007; Bensmail et al., 1997; Richardson and Green, 1997) avoids the problems associated with the maximum likelihood described previously. by replacing the MLE by a maximum a posterior (MAP) estimation. This is namely achieved by adding regularization over the model parameters via prior parameter distributions, which are assumed to be uniform in the case of MLE. The flexible eigenvalue decomposition of the group covariance matrix described previously was also exploited in Bayesian parsimonious model-based clustering as in Bensmail et al. (1997); Bensmail and Meulman (2003); Fraley and Raftery (2007). The parameters of the Bayesian Gaussian mixture are estimated by maximum a posteriori (MAP) estimation by maximizing the posterior parameter distribution. The MAP estimation can still be performed by EM, namely in the case of conjugate priors where the prior distribution is only considered for the parameters of the Gaussian components as in the Bayesian regularization for the parsimonious GMM proposed by Fraley and Raftery (2007) in which the maximization of the posterior is still performed by EM in the MAP framework. However, in general, the common estimation approach in the case the Bayesian GMM described above, is the one using Bayesian sampling such as Markov Chain Monte Carlo (MCMC) sampling techniques, namely the Gibbs sampler. (Neal, 1993; Diebolt and Robert, 1994; Bensmail et al., 1997; Ormoneit and Tresp, 1998; Bensmail and Meulman, 2003; Robert, 1994; Stephens, 1997). 2.2 Model selection in finite mixture model-based clustering Finite mixture model-based clustering requires to specify the number of mixture components (i.e., clusters) and, in the case of parsimonious models, the type of the model. The main issues in this parametric model is therefore the one of selecting the number of mixture components, and possibly the type of the model, that fit at best the data. This problem can be tackled by Normalized Entropy Criteria (NEC) (e.g. (Biernacki et al., 1999)), penalized log-likelihood criteria such as the Bayesian Information Criterion (BIC) (Schwarz, 1978) or penalized classification log-likelihood criteria such as the Approximate Weight of Evidence (Banfield and Raftery, 1993) or the Integrated Classification Likelihood (ICL) (Biernacki et al., 2000), etc, or more generally by using Bayes Factors (Kass and Raftery, 1995) which provide a general way to select and compare models in (Bayesian) statistical modeling, namely in mixture models. 2.3 Proposed work 1: Dirichlet process mixture model-based clustering However, these parametric (Bayesian) finite mixture models may be note well adapted to represent complex and realistic data sets. Another issue of this approach is the one of model selection. Bayesian Non-Parametric (BNP) approaches (Robert, 1994; Hjort et al., 2010) have recently took much attention to propose alternatives for clustering. BNP mixtures, including Dirichlet Process Mixtures (DPM) (Ferguson, 1973; Antoniak, 1974; Pitman, 1995; Wood and Black, 2008; Samuel and Blei, 2012), or by equivalence Chinese Restaurant Process (CRP) mixtures (Pitman, 2002; Samuel and Blei, 2012), also seen as infinite mixtures Rasmussen (2000), which explore the infinite limit of finite mixtures, provide a principled way to overcome the issues in parametric Bayesian and non-Bayesian mixture model-based clustering. These fully Bayesian approaches jointly infer the number of mixture components (i.e clusters) and the mixture parameters, from the data. A Dirichlet Process (DP) Ferguson (1973), denoted as DP(α,G0), is a distribution over distributions and has two parameters, the concentration parameter α0 > 0 and the base measure G0. Assume there is a parameter θ i following a distribution G, that is θ i|G∼ G. Modeling with DP means that we assume that the prior over G is a DP, that is, G is itself generated from a DP. The DP has two properties (Ferguson, 1973). First, random distributions drawn from DP are discrete. Thus, there is a strictly positive probability of multiple observations taking identical values. Suppose we have a random distribution G drawn from a DP followed by repeated draws from that random distribution. (Blackwell and MacQueen, 1973) introduced a Pólya urn representation of the joint distribution of the generated random variables that shows that the DP places its probability mass on a countability infinite collection of points, also called atoms, that is an infinite mixture of Dirac deltas (Ferguson, 1973; Samuel and Blei, 2012): G = ∑k=1 πkδθ k , θ k|G0 ∼G0, k = 1,2, ..., where πk represents the probability assigned to the kth atom which satisfy ∑k=1 = 1, and θ ? k is the location or value of that component (atom). These atoms are drawn independently from the base measure G0. Hence, the generated parameters from a DP exhibit a clustering property, that is, they share repeated values with positive probability where the unique values shared among the variables are independent draws for the base distribution G0 (Ferguson, 1973; Samuel and Blei, 2012). The DP therefore provides a very interesting approach for a clustering perspective. Consider a set of observations (x1, . . . ,xn) to be clustered. Clustering with DP adds a third step to the DP, that is we assume that random variables xi, given the distribution parameters Faicel Chamroukhi Projet de délégation CNRS, année 2015/2016 3/5 Projet de délégation CNRS, 2015/2016 Advanced mixtures for complex high dimensional data: from model-based to Bayesian non-parametric inference θ i, are generated from a distribution f (.|θ i). This is the DP mixture (DPM) (Antoniak, 1974; Escobar, 1994; Pitman, 1995; Wood and Black, 2008; Fox, 2009; Samuel and Blei, 2012). It is shown that when K tends to infinity, the finite Bayesian mixture model described previously converges to a Dirichlet process mixture model (Ishwaren and Zarepour, 2002; Neal, 2000; Rasmussen, 2000). Note that the DPM has an equivalent interpretation in term of the CRP mixture (Wood and Black, 2008; Samuel and Blei, 2012). The DPM has a number of properties which make inference based on this nonparametric prior computationally tractable by using MCMC sapling techniques (Neal, 2000) (Rasmussen, 2000) or variational inference ones (Blei and Jordan, 2006). In Chamroukhi et al. (2014b,a), we developed Bayesian non-parametric parsimonious mixtures that revisit the finite Bayesian and non-Bayesian parsimonious mixture (Banfield and Raftery, 1993; Celeux and Govaert, 1995; Bensmail et al., 1997; Bensmail and Meulman, 2003; Fraley and Raftery, 2007) from the non-parametric prospective. These proposed infinite parsimonious Gaussian mixture models have showed a big flexibility in cluster analysis. A first step of the project will therefore consist in applying theses models in model-based discriminant analysis of dispersed classes, that is infinite mixture discriminant analysis. Then, I will formulate the recently proposed MLE based approach Biernacki and Lourme (2014), into a Bayesian framework. I will first focus on parametric Bayesian (MAP) estimation with model selection with Bayes Factors, and then consider the fully Bayesian non-parametric modeling which will result in a Dirichlet Process parsimonious mixture based on covariance-correlation decomposition of the group covariances matrices. Then, after considered theses models for continuous data, I will examine the infinite mixture of multinomial distributions and application on topic modeling. 3 Bayesian mixture model-based co-clustering 3.1 Model-based Co-clustering The model-based clustering techniques described previously, aim at providing a partition of the data into homogeneous groups of individuals, or possibly in variables. Model-based co-clustering (Govaert and Nadif, 2003, 2008, 2013), also called bi-clustering or block clustering, aim at simultaneously co-clustering the data into homogeneous blocks, a block being a simultaneous association of individuals and variables. They in rely on ‘block’ mixture models (Govaert and Nadif, 2013) and have been developed for binary data (Govaert and Nadif, 2003, 2008; Keribin et al., 2012), categorical data (Keribin et al., 2014), contingency table (Govaert and Nadif, 2003, 2006, 2008) and continuous data (Lomet, 2012; Govaert and Nadif, 2013). The block-mixture can estimated by a block CEM for maximum classification likelihood and hard co-clustering (Govaert and Nadif, 2003, 2006, 2008) or a block (variational) EM for maximum likelihood estimation and fuzzy co-clustering (Govaert and Nadif, 2006). The Lille’s team has also contributed to namely develop publically available software for this quite new topic (Bhatia et al., 2014). 3.2 Bayesian model-based co-clustering The block mixture models have then been examined from a Bayesian prospective to deal with some problems encountered in the MLE approach. Namely, Keribin et al. (2010) proposed a stochastic technique for the latent block model for binary data, by associating a stochastic EM with Gibbs sampling. Recently, in (Keribin et al., 2012, 2014), the authors proposed for the Bayesian formulation of the latent block mixture, for respectively binary data and categorical data, a variational Bayesian inference and Gibbs sampling technique. 3.3 Model selection in block mixture model-based co-clustering In model-based co-clustering approaches, the problem of model selection in general consists in selecting the best number of blocks (co-clusters). The most commonly penalized log-likelihood criteria such as BIC, AIC etc can not be directly used for the block mixture models. Approximations, namely variational are needed such as approximated ICL or BIC-like criteria as in (Lomet et al., 2012b,a; Lomet, 2012). Keribin et al. (2012) proposed a Bayesian sampling algorithm to derive ICL and BIC criteria for model selection in the context of binary data. Then, Keribin et al. (2014) devoloped a Bayesian inference technique using MCMC for the latent block model for categorical data, and a exact ICL for model selection. 3.4 Proposed work 2: Bayesian mixture model-based co-clustering These difficulties, namely in model selection, encountered in the (Bayesian) block mixture model, may be tackled from a Bayesian non-parametric perspective. Indeed, the Dirichlet Process mixture (DPM) described previously, is very flexible Faicel Chamroukhi Projet de délégation CNRS, année 2015/2016 4/5 Projet de délégation CNRS, 2015/2016 Advanced mixtures for complex high dimensional data: from model-based to Bayesian non-parametric inference and can be extended to the case of Dirichlet Process block mixture. The second part of my work in this project will consist in proposing a Dirichlet process block-mixture, to tackle the problem of model-based co-clustering from a fully Bayesian prospective. One way is to consider a generative model involving a couple of Dirichlet Processes as priors distributions over the mixture parameters, dedicated to respectively the individuals and the variables. Thus, sampling fro the coupled the coupled DP should exhibit a clustering property over both the individuals and the variables. The inference should be achieved namely by MCMC sapling, namely Gibbs sampling. 4 Model-based functional data clustering 4.1 Regression mixture model-based functional data clustering Most of these statistical analyses in model-based clustering described above is dedicated to multivariate data analysis. However, in many application domains, the individuals are by nature functions (e.g., curves). The statistical methods for analyzing such data belong to the functional data analysis (FDA) framework (Ramsay and Silverman, 2005). Mixture model-based functional data clustering include regression mixtures using polynomials, splines or B-splines e.g. see (Gaffney and Smyth, 1999; Gaffney, 2004; Chamroukhi, 2010), or random effects polynomial regressions (Gaffney and Smyth, 2004; James and Sugar, 2003; Liu and Yang, 2009) of wavelets (Antoniadis et al., 2013; Devijver, 2014b) or mixture of functional principal components (Jacques and Preda, 2014). The EM algorithm is in generally used to estimate the parameters of the regression mixture by maximizing the observed-data log-likelihood. Recently, Devijver (2014a) considered the problem of regression mixture for high-dimensional data and proposed an lasso-like with an EM algorithm for parameter estimation and variable selection. 4.2 Model selection for functional data analysis The common penalized likelihood criteria used for multivariate data clustering have been used such as BIC, AIC (Liu and Yang, 2009; Samé et al., 2011; Nguyen et al., 2014). For high-dimensional data, the problem of model estimation and model selection can be addressed by using penalized criteria (e.g. lasso) as in Devijver (2014a). In general, proposing sparse model to learn from possibly sparse functions is of great interest, namely for clustering (James and Sugar, 2003). 4.3 Proposed work 3: regression mixture model-based clustering for functional data The standard regression mixture models do not address the problem of non-stationary (dynamical) curves (typically curves with regime changes). We proposed probabilistic generative models for curve modeling segmentation, and cluster and discriminant analyses (Chamroukhi et al., 2009, 2010; Samé et al., 2011; Chamroukhi et al., 2013a,b; Chamroukhi, 2014a). These dynamical mixture models based on hidden process regression, highlighted the importance of using latent variables and latent processes to model complex shaped (dynamical) functional data. The EM algorithms is in the core of these approaches. Recently, I proposed a fully unsupervised way for regression mixture models based on polynomial, spline or B-spline bases decompositions (Chamroukhi, 2013, 2014b). In consists in am EM-like algorithm maximizing a penalized likelihood of the regression mixture model that simultaneously infer the model parameters and the number of mixture components as the learning proceeds. I propose is to continue in this way and propose a fully unsupervised learning approach for functional data based on other basis such as wavelet-based transformations (Antoniadis et al., 2013; Devijver, 2014b). I also plan to examine the use of regression mixtures, including hidden process regression mixtures, for high dimensional data to namely perform model and feature selection in a fully unsupervised way. Namely, the robust EM-like algorithm (Chamroukhi, 2013, 2014b) may be combined with a procedure of variable selection such as the lasso one in (Devijver, 2014b) or to automatically learn and infer the number of mixture components in the mixture functional principal components (Jacques and Preda, 2014). 5 Integration into the Lille’s Probability and Statistics team I plan to spend the scientific stay within the Probability and Statistics team of the lab of Mathematics “Laboratoire PaulPainlevé” UMR CNRS 8524 Université Lille 1. During my research, I plan to collaborate mainly with Pr. Christophe Biernacki, a pioneer in the domain. The expected duration of each of the three proposed work tasks is four months. I therefore plan to spend one year at the lab of Paul-Painlevé. Faicel Chamroukhi Projet de délégation CNRS, année 2015/2016 5/5 Projet de délégation CNRS, 2015/2016 Advanced mixtures for complex high dimensional data: from model-based to Bayesian non-parametric inference

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian modelling and inference difficulties on mixtures of distributions

Today’s data analysts and modellers are in the luxurious position of being able to more closely describe, estimate, predict and infer about complex systems of interest, thanks to ever more powerful computational methods but also wider ranges of modelling distributions. Mixture models constitute a fascinating illustration of these aspects: while within a parametric family, they offer malleable a...

متن کامل

Non-parametric Bayesian modeling of complex networks

Modeling structure in complex networks using Bayesian non-parametrics makes it possible to specify flexible model structures and infer the adequate model complexity from the observed data. This paper provides a gentle introduction to non-parametric Bayesian modeling of complex networks: Using an infinite mixture model as running example we go through the steps of deriving the model as an infini...

متن کامل

Non-parametric Bayesian Hierarchical Factor Modeling and Regression

We address the problem of sparse Bayesian factor regression from high-dimensional gene-expression data where the number and inter-relationship of factors is not known apriori. We take a non-parametric Bayesian approach based on a variant of the Indian Buffet Process [1]. This leads to an interpretable model for gene-pathway relationships, a simple inference procedure, and allows us to consider ...

متن کامل

The Effect of Time-dependent Prognostic Factors on Survival of Non-Small Cell Lung Cancer using Bayesian Extended Cox Model

Abstract Background: Lung cancer is one of the most common cancers around the world. The aim of this study was to use Extended Cox Model (ECM) with Bayesian approach to survey the behavior of potential time-varying prognostic factors of Non-small cell lung cancer. Materials and Methods: Survival status of all 190 patients diagnosed with Non-Small Cell lung cancer referring to hospitals in ...

متن کامل